Multiple Classifier Combination through Ensembles and Data Generation
نویسنده
چکیده
An ensemble of classifiers consists of a set of individually trained classifiers whose predictions are combined when classifying new instances. The resulting ensemble is generally more accurate than the individual classifiers it consists of. In particular, one of the most popular ensemble methods, the Boosting approach, improves the predictive performance of weak classifiers, which can achieve only an error rate of at least slightly better than random guessing. This improvement is achieved by focusing on examples that are hard to be correctly classified, i.e. the so-called hard examples. The Boosting approach has been applied to many real world domains, including medicine, bioinformatics, fraud detection and text classification, amongst others, with great success. However, in many domains, Boosting algorithms frequently suffer from overemphasizing these hard examples. That is, the Boosting algorithm puts too much effort on the hard examples, thus leading to poor training and test set accuracies. Moreover, the knowledge acquired from such hard examples may be insufficient to improve the overall accuracy of the ensemble. Also, when applied to two-class domains with imbalanced class frequencies, where the number of examples of one (majority) class is much higher than the other (minority), a traditional Boosting algorithm tends to produce a high predictive accuracy over the majority class, but poor accuracy over the minority class. This is due to a number of factors. Firstly, the learning algorithms tend to ignore small classes while concentrating on classifying the large ones accurately (i.e. the so-called learning bias). Secondly, the weight updating mechanism of the Boosting algorithms may deteriorate the learning bias. Thirdly, examples from the minority class may provide insufficient knowledge for learning. In addition, the few examples from the minority class are much easier to be over-emphasized by the Boosting algorithms. This thesis introduces new techniques to address the above-mentioned two problems. Two approaches, namely the DataBoost and DataBoost-IM algorithms, are provided to extend Boosting algorithms’ predictive performance. The DataBoost algorithm is designed to assist Boosting algorithms to avoid overemphasizing hard examples. In the DataBoost algorithm, new synthetic data with bias information towards hard examples are added to the original training set when training the component classifiers. The DataBoost approach was evaluated against ten data sets, using both decision trees and neural networks as base classifiers. The experiments show promising results, in terms of overall accuracy when compared to a standard benchmarking Boosting algorithm. The DataBoost-IM algorithm is developed to learn from two-class imbalanced data sets. In the DataBoost-IM approach, the class frequencies and the total weights against different classes within the ensemble’s training set are rebalanced by adding new synthetic data. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy, against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. The experimental results show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking Boosting algorithm and three advanced Boosting-based algorithms for imbalanced data sets. Results indicate that the DataBoost-IM approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.
منابع مشابه
A comparative study of classifier ensembles for bankruptcy prediction
The aim of bankruptcy prediction in the areas of data mining and machine learning is to develop an effective model which can provide the higher prediction accuracy. In the prior literature, various classification techniques have been developed and studied, in/with which classifier ensembles by combining multiple classifiers approach have shown their outperformance over many single classifiers. ...
متن کاملEvolutionary Tuning of Combined Multiple Models
In data mining, hybrid intelligent systems present a synergistic combination of multiple approaches to develop the next generation of intelligent systems. Our paper presents an integration of a Combined Multiple Models (CMM) technique with an evolutionary approach that is used for tuning of parameters. Proposed hybrid classifier was tested in microarray analysis domain. This domain was chosen i...
متن کاملCarnegie Mellon University Optimal Classifier Ensembles for Improved Biometric Verification
In practical biometric verification applications, we expect to observe a large variability of biometric data. Single classifiers have insufficient accuracy in such cases. Fusion of multiple classifiers is proposed to improve accuracy. Typically, classifier decisions are fused using a decision fusion rule. Usually, research is done on finding the best decision fusion rule, given the set of class...
متن کاملImprovement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...
متن کاملAn Experimental Study of a Self-Supervised Classifier Ensemble
Learning using labeled and unlabelled data has received considerable amount of attention in the machine learning community due its potential in reducing the need for expensive labeled data. In this work we present a new method for combining labeled and unlabeled data based on classifier ensembles. The model we propose assumes each classifier in the ensemble observes the input using different se...
متن کاملAn experimental study on diversity for bagging and boosting with linear classifiers
In classifier combination, it is believed that diverse ensembles have a better potential for improvement on the accuracy than nondiverse ensembles. We put this hypothesis to a test for two methods for building the ensembles: Bagging and Boosting, with two linear classifier models: the nearest mean classifier and the pseudo-Fisher linear discriminant classifier. To estimate diversity, we apply n...
متن کامل